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BANDWIDTH SELECTION FOR SMOOTH BACKFITTING IN 

ADDITIVE MODELS 

By Enno Mammen-*^ and Byeong U. Park^ 

University of Mannheim and Seoul National University 

The smooth backfitting introduced by Mammen, Linton and Nielsen 
[Ann. Statist. 27 (1999) 1443-1490] is a promising technique to fit ad- 
ditive regression models and is known to achieve the oracle efficiency 
bound. In this paper, we propose and discuss three fully automated 
bandwidth selection methods for smooth backfitting in additive mod- 
els. The first one is a penalized least squares approach which is based 
on higher-order stochastic expansions for the residual sums of squares 
of the smooth backfitting estimates. The other two are plug-in band- 
width selectors which rely on approximations of the average squared 
errors and whose utility is restricted to local linear fitting. The large 
sample properties of these bandwidth selection methods are given. 
Their finite sample properties are also compared through simulation 
experiments. 

1. Introduction. Nonparametric additive models are a powerful tech- 
nique for high-dimensional data. They avoid the curse of dimensionality 
and allow for accurate nonparametric estimates also in high-dimensional 
settings; see Stone [20] among others. On the other hand, the models are 
very flexible and allow for informative insights on the influences of different 
covariates on a response variable. This is the reason for the popularity of 
this approach. Estimation in this model is much more complex than in clas- 
sical nonparametric regression. Proposed estimates require application of 
iterative algorithms and the estimates are not given as local weighted sums 
of independent observations as in classical nonparametric regression. This 
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complicates the asymptotic analysis of the estimate. In this paper we dis- 
cuss practical implementations for the smooth backfitting algorithm. Smooth 
backfitting was introduced in [9]. In particular, we will discuss data-adaptive 
bandwidth selectors for this estimate. We will present asymptotic results for 
the bandwidth selectors. Our main technical tools are uniform expansions 
of the smooth backfitting estimate of order op(n~^/^) that allow us to carry 
over results from classical nonparametric regression. 

There have been three main proposals for fitting additive models: the or- 
dinary backfitting procedure of Buja, Hastie and Tibshirani [1], the marginal 
integration technique of Linton and Nielsen [8] and the smooth backfitting of 
Mammen, Linton and Nielson [9]. Some asymptotic statistical properties of 
the ordinary backfitting have been provided by Opsomer and Ruppert [13] 
and Opsomer [12]. Ordinary backfitting is not oracle efficient, that is, the 
estimates of the additive components do not have the same asymptotic prop- 
erties as if the other components were known. The marginal integration 
estimate is based on marginal integration of a full dimensional regression 
estimate. The statistical analysis of marginal integration is much simpler. 
In [8] it is shown for an additive model with two additive components that 
marginal integration achieves the one-dimensional n~^/^ rate of convergence 
under the smoothness condition that the component functions have two con- 
tinuous derivatives. However, marginal integration does not produce rate- 
optimal estimates unless smoothness of the regression function increases 
with the number of additive components. The smooth backfitting method 
does not have these drawbacks. It is rate-optimal and its implementation 
based on local linear estimation achieves the same bias and variance as the 
oracle estimator, that is, the theoretical estimate that is based on knowing 
other components. It employs a projection interpretation of popular kernel 
estimators provided by Mammen, Marron, Turlach and Wand [10], and it is 
based on iterative calculations of fits to the additive components. A short 
description of smooth backfitting will be given in the next two sections. This 
will be done for Nadaraya-Watson kernel smoothing and for local linear fits. 

For one-dimensional response variables and d-dimensional covariates 
X* = {XI, . . . , X^) (i = 1, . . . ,n) the additive regression model is defined as 

d 

(1.1) Y' = mo + Y.mj{X})+e\ 

i=i 

where X* = {XI, . . . , X^) are random design points in R'^, are unobserved 
error variables, nii, . . . , rud are functions from R to R and mo is a constant. 
Throughout the paper we will make the assumption the tuples are 
i.i.d. and that the error variables have conditional mean zero (given the 
covariates X*). Furthermore, it is assumed that Emj{Xj) = for j = 1, . . . , d 

and that Y^j=i fj{Xj) = a.s. implies fj = for all j. Then the functions 
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rrij are uniquely identifiable. The latter assumption is a sufficient condition 
to avoid concurvity as termed by Hastie and Tibshirani [5] . 

Our main results are higher-order stochastic expansions for the residual 
sums of squares of the smooth backfitting estimates. These results motivate 
the definition of a penalized sum of squared residuals. The bandwidth that 
minimizes the penalized sum will be called penalized least squares bandwidth. 
We will compare the penalized sum of squares with the average weighted 
squared error (ASE) 

n f d d ~| 2 

(1.2) ASE = n-^Y.^{X')\mo + Y,m,{X'^-mo-Y,mj{Xi) \ . 

i=i [ j=i j=i J 

Here w is a weight function. We will show that up to an additive term 
which is independent of the bandwidth the average weighted squared er- 
ror is asymptotically equivalent to the penalized sum of squared residuals. 
This implies that the penalized least squares bandwidth is asymptotically 
optimal. The results for Nadaraya- Watson smoothing are given in the next 
section. Local linear smoothing will be discussed in Section 3. 

In addition to the penalized least squares bandwidth choice, we discuss 
two plug- in selectors. The first of these is based on a first-order expansion of 
ASE given in (1.2). This error criterion measures accuracy of the sum of the 
additive components. An alternative error criterion measures the accuracy 
of each single additive component, 

n 

ASEj = ^ w;j(Xi){mj(Xj) - mj{Xj)}^. 
1=1 

Here wj is a weight function. Use of ASEj instead of ASE may be motivated 
by a more data-analytic focus of the statistical analysis. Additionally, a 
more technical advantage holds for local linear smoothing. The first-order 
expansion of ASEj only depends on the corresponding single bandwidth 
and does not involve the bandwidths of the other components. In particular, 
the plug- in bandwidth selector based on the approximation of ASEj can be 
written down explicitly. For Nadaraya- Watson backfitting estimates the bias 
of a single additive component depends on the whole vector of bandwidths. 
Therefore an asymptotic expansion of ASEj involves the bandwidths of 
all components. Also for the global error criterion ASE implementation of 
plug-in rules for Nadaraya- Watson smoothing is much more complex. The 
bias part in the expansion of ASE for the Nadaraya- Watson smoothing has 
terms related to the multivariate design density, a well-known fact also in 
the single smoother case, and the bias expression may not even be expressed 
in a closed form. For these reasons, our discussion on plug-in bandwidths 
will be restricted to local linear fits. 
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In classical nonparametric regression, the penalized sum of squared resid- 
uals which we introduce in this paper is asymptotically equivalent to cross- 
validation [4]. We conjecture that the same holds for additive models. The 
approach based on penalized sum of squared residuals is computationally 
more feasible than cross-validation. It only requires one nth of the comput- 
ing time that is needed for the latter. In the numerical study presented in 
Section 5, we found that the penalized least squares bandwidth is a good 
approximation of the stochastic ^S'i?-minimizer. It turned out that it out- 
performs the two plug-in bandwidths by producing the least ASE, while for 
accuracy of each one-dimensional component estimator, that is, in terms of 
ASEj, none of the bandwidth selectors dominates the others in all cases. 
In general, plug-in bandwidth selection requires estimation of additional 
functionals of the regression function (and of the design density). For this 
estimation one needs to select other tuning constants or bandwidths. Quan- 
tification of the optimal secondary tuning constant needs further asymptotic 
analysis and it would require more smoothness assumptions on the regres- 
sion and density functions. See [15], [16] and [19]. In this paper, we do not 
pursue this issue for the plug-in selectors. We only consider a simple choice 
of the auxiliary bandwidth. 

In this paper we do not address bandwidth choice under model misspec- 
ification. For additive models this is an important issue because in many 
applications the additive model will only be assumed to be a good approx- 
imation for the true model. We conjecture that the penalized least squares 
bandwidth will work reliably also under misspecification of the additive 
model. This conjecture is supported by the definition of this bandwidth. 
Performance of the plug-in rules has to be carefully checked because in their 
definitions they make use of the validity of the additive model. 

There have been many proposals for bandwidth selection in density and 
regression estimation with single smoothers. See [17] and [7] for kernel den- 
sity estimation, and [6] for kernel regression estimation. For additive models 
there have been only a few attempts for bandwidth selection. These in- 
clude [14] where a plug-in bandwidth selector is proposed for the ordinary 
backfitting procedure, [21] where generalized cross-validation is applied to 
penalized regression splines and [11] where cross-validation is discussed for 
smooth backfitting. 

In this paper we discuss smooth backfitting for Nadaraya- Watson smooth- 
ing and for local linear smoothing. For practical implementations we defi- 
nitely recommend application of local linear smoothing. Local linear smooth 
backfitting achieves oracle bounds. The asymptotic bias and variance of the 
estimate of an additive component do not depend on the number and shape 
of the other components. They are the same as in a classical regression model 
with one component. This does not hold for Nadaraya-Watson smoothing. 
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Nevertheless in this paper we have included the discussion of Nadaraya- 
Watson smoothing. This has been done mainly for clarity of exposition of 
ideas and proofs. Smooth backfitting with local linear smoothing requires 
a much more involved notation. This complicates the mathematical discus- 
sions. For this reason we will give detailed proofs only for Nadaraya-Watson 
smoothing. Ideas of the proofs carry over to local linear smoothing. In Sec- 
tion 2 we start with Nadaraya-Watson smoothing. Smooth backfitting for 
local linear smoothing is treated in Section 3. Practical implementations 
of our bandwidth selectors are discussed in Section 4. In Section 5 simula- 
tion results are presented for the performance of the discussed bandwidth 
selectors. Section 6 states the assumptions and contains the proofs of the 
theoretical results. 



2. Smooth backfitting with Nadaraya Watson smoothing. We now de- 
fine the smooth Nadaraya-Watson backfitting estimates. The estimate of the 
component function nij in (1.1) is denoted by fhj^^ . We suppose that the 
covariates Xj take values in a bounded interval Ij . The backfitting estimates 
are defined as the minimizers of the following smoothed sum of squares: 

(2.1) J2 / r'-<^-5:mf^(u,) K,{n,X^)dn. 

i=i''^ [ j=i J 

The minimization is done under the constraints 



(2.2) /_ mf'^{uj)pj{uj) duj = 0, j = l,...,d. 



Here, I = h x ■ ■ ■ x 1^ and Kh{u,x'') = Kh^{ui,x\) Kh^{ud,xl]) is a 

d-dimensional product kernel with factors Kh.{uj,Vj) that satisfy for all 

(2.3) Kh.{uj,Vj)duj = 1. 

The kernel K^. may depend also on j. This is suppressed in the notation. 
In (2.2) Pj denotes the kernel density estimate of the density pj of Xj, 

n 

(2.4) p,{uj)=n-^Y.Kh,{u,,X'j). 

i=l 

The usual choice for Kh.{uj,Vj) with (2.3) is given by 

K[h-\v,-Uj)] 



(2.5) Kh^{uj,vj) 



\vj-Wj)]dwj 
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Note that for Uj , vj in the interior of Ij we have 

Kh^{uj,Vj) = h~'^K[h~^{vj - Uj)] 

when K integrates to 1 on its support. 

By differentiation one can show that a minimizer of (2.1) satisfies for 
j = 1, . . . ,d and uj G Ij 

n „ f d ^ 

1 = 1'' I k = l ) 

and thus 

1=1-'^ I k=l ) 

where I-j = Ii x • • • x x /j+i x • • • x and n_j = {ui, . . . , itj_i, tij+i, . . . , Ud)- 
Now, because of (2.3) we can rewrite these equations as 

(2.6) m'j' [Uj] = mf [Uj] - 2^ / "ifc (^fc) ^ / n ^^fc - "Iq , 

n 

(2.7) m^^=n-i^y\ 

1=1 

where Pjk{uj,Uk) = n~^J27=i^hjiuj,Xj)Kh^{uk,Xl) is a two-dimensional 
kernel density estimate of the marginal density pjk of (Xj, X^). Furthermore, 
rfij^^ (uj) denotes the Nadaraya- Watson estimate 

n 
i=l 

In case one does not use kernels that satisfy (2.3), equations (2.6) and (2.7) 
have to be replaced by slightly more complicated equations; see [9] for details. 

Suppose now that Ij = [0, 1] , and define for a weight function w and a 
constant C'jj > 

n 

RSS{h) = n-^ HCnn'^^^ < Xj < 1 - Chu'^/^ for l<j<d) 
1=1 

^'•^^ X w{xn{Y^ - - fhnxi) — fhnxm', 

n 

ASE{h) = rT^ E < < 1 - CjiVT^I^ for 1 < j < d) 

i=l 

(2.9) X w{X'){m^'^ + ^(Xi) + • • • + ?n^^(X^) 

-mo-mi(Xi) mrf(X^)}2, 
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where l(^) denotes the indicator which equals 1 if A occurs and other- 
wise. The indicator function has been included in (2.8) and (2.9) to exclude 
boundary regions of the design where the Nadaraya- Watson smoother has 
bias terms of order n~^^^. In the following Theorems 2.1 and 2.2 we will 
consider bandwidths hj that are smaller than C^n~^/^. Because we assume 
that the kernel K has support [—1,1] [see assumption (Al) in Section 6.1], 
boundary regions with higher-order bias terms are then excluded. We now 
state our first main result. The assumptions can be found in Section 6. 

Theorem 2.1. Suppose that assumptions (A1)-(A4) apply for model 
(1.1) and that fh^^ are defined according to (2.1) and (2.2). Assume that 
Ij are bounded intervals (Ij = [0,1] w.l.o.g.) and that Kh.{uj,Vj) are ker- 
nels that satisfy Kh.{uj,Vj) = hJ^K[hJ^{vj — Uj)] for hj <Vj<l — hj for 
a function K and Kh.{uj,Vj) = for \vj — Uj\ > hj. Then with C'^ as in 
(2.8) and (2.9) and for all constants Ch < Cjj, we have uniformly for 
Cnn-^l^ < hj < C7^n-V5 

n 

RSSih) - HC'Hn-^/^ < < 1 - C^n-^/^ 

1=1 

(2.10) for I < j < d)w{X'){e'f 

+ 2n-i I X:M^^)(^^)' I |i^(0) E - ^^^W = 

Furthermore, for fixed sequences h with CHn~^/^ < hj < C'^n~^/^ , this dif- 
ference is of order Op{n~''^^^'^). 

To state the second main result, let Pj{h,Uj), j = 1,. . . ,d, denote mini- 
mizers of / {(3{h,u) — j3i{h,ui) — • • • — (3d{h,Ud)}'^p{u) du, where 




The functions f3j{h,Uj), j = 1, . . . ,d, are uniquely defined only up to an 
additive constant. However, their sum is uniquely defined. Define 




Theorem 2.2. Under the assumptions of Theorem 2.1, we have uni- 
formly for Cun'^/^ < hj < C'^jU'^/^ , 

-in „ d 

ASE{h) = -Y: w{X^){e^f / K\t) dtY.— 

i=l j=l j 
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PLS{h) - ASE{h) 



(2.11b) . . „ 

forl<j<d)w{X'){r'^^ 

+ 0p(n-^/5). 



i=l 

for l<j <d)w{X'){e'y 
Now we define 

hpLs = argmin PL5(/i), 
= aigm\nASE{h). 

Here and throughout the paper, the "argmin" runs over h with Cnfi"^^^ < 
hj < C'^n-^l^. It would be a more useful result to have some theory for a 
bandwidth selector that estimates the optimal bandwidth over a range of 
rates, for example, hj £ [An""" ,Bn~^] for some prespecified positive con- 
stants a,b,A,B. This would involve uniform expansions of RSS{h) and 
ASE{h) over the extended range of the bandwidth, which undoubtedly 
makes the derivations much more complicated. Thus, it is avoided in this 
paper. 

The following corollary is an immediate consequence of Theorem 2.2. 

Corollary 2.3. Under the conditions of Theorem 2.1 
hpLS - hASE = Op(?i"-^/^). 

We conjecture that {hpis — hASE) /hASE = Op{n~^/^^). This is suggested 
by the fact that for fixed h in Theorem 2.2 the error term 0p{n-'^/^) can be 
replaced by Op{n~^/^^). 

3. Smooth backfitting using local linear fits. The smooth backfitting lo- 
cal linear estimates are defined as minimizers of 



n „ ( d 



1=1 L j=l 

(3.1) 

d 

LLA 



J2 ^'^i{nj){Xi - uj) Kh{u,X') du. 
j=i J 
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Here fhj^ is an estimate of mj and fn^^'^ is an estimate of its derivative. 

By using slightly more coniplicated arguments than those in Section 6 
one can show that mg^, . . . , fh^ ' satisfy the equations 




Here and below, 

n 

Sij{ui,Uj) =n-^Y.^hi{ui, Xj ) Kh^ {uj , X] ) 
1=1 

n 
i=l 

and pj is defined as in the last section. For each j, the estimates fh'-^ and fh'^^' 
are the local linear fits obtained by regression of Y"^ onto Xj; that is, these 
quantities minimize 

j2{y' - (%) - ^~r^''\u,){X] - u,)YKh^{u,,X^. 

i=l 

A detailed discussion on why (3.1) is equivalent to (3.2) and (3.3) can be 
found in [9], where a slightly different notation was used. The definition 
of m-o^, . . . ,m^^'^ can be made unique by imposing the additional norming 
conditions 




The smooth backfitting estimates can be calculated by iterative applica- 
tion of (3.2). In each application the current versions of m^^,?n^^^'^ {I ^ 
j) are plugged into the right-hand side of (3.2) and are used to update 



1 XI - ui 

X]-u^ (Xl-ui)(Xi-u,) 
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^ LL ^ LL.l 
nij ,m 



>■ ' . The iteration converges with geometric rate (see [9]). The num- 
ber of iterations may be determined by a standard error criterion. After the 



LL 



last iteration, a norming constant can be subtracted from the last fit of 
so that (3.4) holds. Because of (3.4) this yields m^^ = n'^ Ya=iY' ■ 

We now define the residual sum of squares RSS{h) and the average 
squared error. This is done similarly as in (2.8) and (2.9). But now the sums 
run over the full intervals Ij. This differs from Nadaraya- Watson smooth- 
ing where the summation excludes boundary values. For Nadaraya- Watson 
smoothing the boundary values are removed because of bias problems. Let 

(3.5) RSS{h) = HX'W - mo" " "^d^(^d)}', 

i=l 

n 

ASE{h) = «;(X^){mo" + m[\x{) + • • • + m^^(X^) 

i=l 

(3.6) . , 

-mo-mi(Xl) mrf(X^)}2. 

As in Section 6 we define the penalized sum of squared residuals 
PL5(/i) = i?55(/i) 1 1 + 2 -i-K(O) I . 

The penalized least squares bandwidth hpis is again given by 

hpis = aicgmin PLS{h). 

Define 

Pj{uj) = ^m'^iuj) [ t^K{t)dt. 



Analogous to Theorems 2.1, 2.2 and Corollary 2.3, we now get the following 
results for local linear smoothing. 



Theorem 3.1. Suppose that assumptions (A1)-(A4) apply, that Ij = 
[0,1] and that fh^^ is defined according to (3.1) and (3.4) with kernels 
Khj{uj,Vj). The kernels are supposed to satisfy the conditions of Theo- 
rem 2.1. Then, uniformly for Chu'^/^ < hj < C'^n~^/^ , 

i?55(/i)-n-if:^X^)(e*)2 + 2n-i|x:M^*)(HT||^(0)E;^} 

(3 7) ^ 
- ASE{h) = Op{n-^l^), 
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ASE{h) = I^J^w{u)p{u)E[{e')^\X'=u] dnj J K^{t)dtY,^ 

(3-8) 



4/5 ^ 



+ J I ^ w{u)p{u) du + Op{n 
1 " 

(3.9) PLS{h) - ASE{h) = -Y,w{X'){£'f + Op{n-^/^), 

Tl . 
1=1 

(3.10) hpLs-hASE=Op{n-^'^). 

For fixed sequences h with Cun'^^^ < hj < C'^n^^^^ , the expansions in 
(3.7)-(3.9) hold up to order Op{n-'^/^^). 

If the errors of the expansions in (3.7)-(3.9) would be of order 0^(^-9/1°), 
uniformly in /i, this would imply {hpis — hASE)/hASE = Op{n~^/^^). 

Next we consider plug-in bandwidth selectors. As for penalized least 
squares, plug-in bandwidth selectors may be constructed that approximately 
minimize ASE{h). Let AASE{h) denote the nonstochastic first-order expan- 
sion of ASE{h), given in (3.8). Define 

AASE{h) = wi^'W'?[ I K^t) dt^ J2 

Here rh'- is an estimate of m'- and e * = y* — m{X^) are residuals based on 
an estimate 771(2;) of mo -fmi(a::i) H + md{xd)- Choices of fhj and fh will 

be discussed below. A plug- in bandwidth hpi = {hpi i, . . . ,hpi d) is defined 
by 

(3.11) hpL = avgmmAASE{h). 

The plug-in bandwidth hpi will be compared with the theoretically opti- 
mal bandwidth /loptj 

(3.12) hopt = argmin A ASE{h). 

There is an alternative way of plug-in bandwidth selection for another error 
criterion. It is based on an error criterion that measures accuracy of each 
one-dimensional additive component separately. Let 

n 

(3.13) ASEj{h) = n-^ ^ WjiX}){mf^{X'j) - mj{Xj)}^, 

i=l 
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where wj is a smooth weight function. It may be argued that ASEj is more 
appropriate if the focus is more data-analytic interpretation of the data 
whereas use of ASE may be more appropriate for finding good prediction 
rules. Our next result shows that in first-order ASEj{h) only depends on 
hj . This motivates a simple plug-in bandwidth selection rule. An analogous 
result does not hold for Nadaraya-Watson smoothing. 

Theorem 3.2. Under the assumptions of Theorem 3.1, it holds that, 
uniformly for h with Cun"^^^ < h < C'^jU^^/^ {I < I < d). 



nhj 



ASEj{h) = I y Wj{uj)pj{uj)E[{e'f\Xi = Uj] dnj|| j K'^{t) dt 
(3.14) + hilj m'^{ujfwj{uj)pj{uj) duj | J t^K{t) dt] 

The first-order expansion of ASEj{h) in (3.14) is minimized by 



opt,i 



■ n 



-1/5 



Wj{uj)pj{uj)E[{e^)'^\Xj = Uj] duj 



m'-{uj)'^Wj{uj)pj{uj) dujl I t^K{t)dt 



K\t)dt 

2i -1/5 



1/5 



We note that /lopt defined in (3.12) is different from h*^^ = {h*^^ ^, . 
Now this bandwidth can be estimated by 



opt,d' ■ 



■ n 



-1/5 



(3.15) 



n 



-'Y.^,{X]){e 



i\2 



i=l 



K'^{t)dt 



nl/5 



n 



'j2w,{X})m]{X}f \ ft'K{t)dt 
i=i ^ 



-1/5 



with an estimate fhj of and residuals = y* — m{X^) based on an 
estimate fh{x) of niQ + mi{xi) -!-••• + md{xd)- Contrary to hpi, approx- 
imation of the bandwidth selector /ip^ does not require a grid search on 
a high-dimensional bandwidth space or an iterative procedure with a one- 
dimensional grid search. See the discussion at the end of Section 4. 

Now we present a procedure for estimating m'-, which is required to im- 
plement hpL and h*pj^. A simple estimate of m'- may be given by smoothed 
differentiation of fh^^. However, a numerical study for this estimate revealed 
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that it suffers from serious boundary effects. We propose to use an alterna- 
tive estimate which is based on a local quadratic fit. It is defined by 

(3.16) rhj{uj) = 2Pj^2{uj), 

where Pj^2{uj) along with f3jfl{uj) and (3j^i{uj) minimizes 

/ {fh^^{vj) - Pjfliuj) - Pj^i{uj){vj - Uj) - (3j^2iuj){vj - Ujf}^ 

X L[9j^{vj-Uj)]dvj. 

The definitions of hpi and h*pi make use of fitted residuals. But these 
residuals along with the local quadratic estimate of m'- defined in (3.16) 
involve application of the backfitting regression algorithm. For these pilot 
estimates one needs to select another set of bandwidths. Iterative schemes 
to select fully data-dependent plug-in bandwidths are discussed in Section 4. 

The next theorem states the conditions under which ml- is uniformly 

consistent. This immediately implies that hpi — /lopt and h*p^ — h*^^ are of 
lower order. 



Theorem 3.3. Suppose that assumption (A5), in addition to the as- 
sumptions of Theorem 3.1, holds. Then, for gj with gj and gj'^n~'^/^{logn) 
0, we have uniformly for <Uj <1, 

m'-{uj) - m'-{uj) = Op(l). 

Suppose additionally that 

1 " 

^ w{X^){m{X') - mo - m^{Xi) m,(X^)}2 = op(l). 

Then 



1=1 



hpL - /lopt = Opin ^/^). 

If additionally 

1 

n 



■Y.Wj{Xi){m{X') - mo - mi{X{) mrf(X^)}2 = op(l), 

i=l 

then 



h*PL-Kpt = Op{n 1/5). 

We now give a heuristic discussion of the rates of convergence of {hpLj — 
hopt,j)/hopt,j and {h*pi j — /loptjO/Z'-optj- For simplicity we consider only the 
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latter. Similar arguments may be applied to the former. Note that the rate 
of the latter coincides with that of 

n 
i=l 

We now suppose that m'-{uj) can be decomposed into m'-{uj) + bias(tij) + 
stoch(nj), where bias(nj) is a bias term and stoch(iij) is a mean zero part 
consisting of local and global averages of e*. Under higher-order smooth- 
ness conditions one may expect an order of gj for bias(Mj) and an order of 
(n^l)""*^/^ for stoch(uj). Now 

n 
i=l 

n n 

= WjiX'j)hias{X'jf + 2n-^ ^ Wj{X'j) bias(Xi) stoch(Xj) 

i=l 1=1 

n n 

+ ^i(^j) stoch(Xj)2 + 2n~i ^«;j(Xi)mJ(X]) bias(X]) 

i=l i=l 
n 

+ 2n-i Y Wj{Xi)m"j{Xi)sioc\y{Xi). 

i=l 

By standard reasoning, one may find the following rates of convergence for 
the five terms on the right-hand side of the above equation: g^, n~^''^g'j ,n~^g~^, 

The maximum of these orders is minimized by gj ~ n"^/^, leading to (hpi j — 
hf')/hf' = 0^{n~^"). 

The relative rate Op(n~^/^) for the plug-in bandwidth selectors is also 
achieved by the fully automated bandwidth selector of Opsomer and Rup- 
pert [14], and is identical to the rate of the plug-in rule for the one-dimensional 
local linear regression estimator of Ruppert, Sheather and Wand [18]. We 
note here that more sophisticated choices of the constant factor of n~^/^ for 
the bandwidth gj would yield faster rates such as tt,"^/^^ or even n~^/^^. See 
[15, 16] or [19]. 

4. Practical implementation of the bandwidth selectors. We suggest use 
of iterative procedures for approximation of hpis, hpi and /ip^. We note 
that use of hpi and h*p^ is restricted to local linear smooth backfitting. For 
hpLS we propose use of the iterative smooth backfitting algorithm based 
on (2.6) for Nadaraya-Watson smoothing and (3.2) for the local linear fit, 
and updating of the bandwidth hj when the jth additive component is 
calculated in the iteration step. This can be done by computing PLS{h) for 
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a finite number of hj's witfi hi, ... , /ij+i, ■ . ■ ,hd being held fixed, and 
then by replacing hj by the minimizing value of hj . Specifically, we suggest 
the following procedure: 

Step 0. Initialize h^^'^ for j = 1, . . . , d. 

Step r . Find /i^ = arg min^^, PL5(/if "^^ , . . . , /ijl"^' , , h^J^^^ , • • • , /i^'^^ on 
a grid of hj , for j = 1, . . . ,d. 

The computing time for the above iterative procedure to find hpis is 
R X d X N X C where R denotes the number of iterations, N is the number 
of points on the grid of each hj and C is the time for the evaluation of PLS 
(or equivalently RSS) with a given set of bandwidths. This is much less 
than the computing time required for the d-dimensional grid search, which 
is N'^ X C. 

In the implementation of the iterative smooth backfitting algorithm, the 
estimate ihj could be calculated on a grid of Ij. The integrals used in the 
updating steps of the smooth backfitting can be replaced by the weighted 
averages over this grid. For the calculation of PLS{h) we need mj{Xj). These 
values can be approximated by linear interpolation between the neighboring 
points on the grid. In the simulation study presented in the next section we 
used a grid of 25 equally spaced points in the interval Ij = [0, 1]. 

Next we discuss how to approximate hpi for the local linear smooth 
backfitting. We calculate the residuals by use of a backfitting estimate. This 

means that we replace n~^J2i'=i'^i-^^)i^)^ AASE by RSS as defined 
in (3.5). Recall that RSS involves the bandwidth h = {hi, . . . , h^), and that 
the local quadratic estimate fhj defined in (3.16) depends on the bandwidth 
gj as well as h. The residual sum of squares RSS and the estimate m" 
depend on the bandwidths h of the smooth backfitting regression estimates. 
To stress this dependence on h and gj, we write RSS{h) and m"{-; h,gj) for 
RSS and m", respectively. We propose the following iterative procedure for 

hpL-. 

Step 0. Initialize /i™ = {hf\. . . , /if'). 

Step r. Compute on a grid of h = {hi , h^) 




j=i I j=i ) 

X I f t^K{t) dt 
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^'^ (c = 1.5 or 2, say), and then find 



with gj = c hj 

/jM = argminIXs^'''V)• 
A more sophisticated choice of gj suggested by the discussion at the end 

5/7 

of Section 3 would be gj = chj for some properly chosen constant c> 0. 

We also give an alternative algorithm to approximate hpi, which requires 
only a one-dimensional grid search. This would be useful for very high- 
dimensional covariates: 

Step 0'. Initiahze h^°^ = (/if^ ,...,hf). 

Step r'. For j = 1, . . . ,d, compute on a grid of hj 



AXsE'''\ht'\ 



d 



RSSihl^~^^){lK\t)dt]\-^ + j:-l 
1 " f 



2 



t'K{t)dt 



and then find 

[r] 



aigmmAAbE [h\ ' , . . . ,hj_^,hj,h'j_^_-^^' , ^ 



In the grid search for h - we use /il'^-^l rather than (/if "^1 , . . . , /ijl"/' , hj , hj'^'^ h^^~'^ ) 
for RSS and fh". The reason is that the latter requires repetition of the 
whole backfitting procedure (3.2) for every point on the grid of the band- 
width. Thus, it is computationally much more expensive than our second 
proposal for approximating hpi- ^ 

Finally, we give an algorithm to approximate h*p^ ■ . We suppose calcula- 
tion of the residuals by use of a backfitting estimate. This means that we 
replace n-^YJl=iWj{Xi){e^f in (3.15) by RSS. Thus, h*p^j is given by 

r ( r ^^^/^ 
-1/5 RSS X K\t)dt\ 



h 



PL,j 



n 



(4.1) 



n 



-i5:u;,(X])m;'(Xj)2 t'mdt 



i=l 



-1/5 
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We propose the following iterative procedure for /ip^ . Start with some initial 
band widths hi,...,hd and calculate fhi^, . . . ,fhj^^ with these bandwidths, 
and compute RSS . Choose gj = chj (with c = 1.5 or 2, say). Then calcu- 
late ?n'/, . . . ,fh'^ by (3.16). Plug RSS and the computed values of m^'(Xj)'s 
into (4.1), which defines new bandwidths hi,. . . ,hd- Then the procedure can 
be iterated. 

It was observed in the simulation study presented in Section 5 that the 
iterative algorithms for approximating hpis, hpi and h*pj^ converge very 
quickly. With the convergence criterion 10"'^ on the relative changes of the 
bandwidth selectors, the average (out of 500 cases) numbers of iterations 
for the three bandwidth selectors were 4.27, 6.30 and 5.23, respectively. The 
worst cases had eight iterations. 

5. Simulations. In this section we present simulations for the small sam- 
ple performance of the plug-in selectors hpi, h*pi and the penalized least 
squares bandwidth hpis- We will do this only for local linear smooth back- 
fitting. 

Our first goal was to compare how much these bandwidths differ from their 
theoretical targets. For this, we drew 500 datasets {X^,Y^), i = 1, . . . ,n, with 
n = 200 and 500 from the model 

(Ml) Y' = mi{Xi)+m2{Xl) + m3{Xi) + e\ 

where mi(xi) = x\, m2{x2) = x'2, m^{x^) = and are distributed as 
A^(0,0.01). The covariate vectors were generated from joint normal distri- 
butions with marginals A''(0.5,0.5) and correlations {pi2, P13, P23) = (0,0,0) 
and (0.5,0.5,0.5). Here pij denotes the correlation between Xi and Xj. If the 
generated covariate vector was within the cube [0, 1]^, then it was retained 
in the sample. Otherwise, it was removed. This was done until arriving at 
the predetermined sample size 200 or 500. Thus, the covariate vectors follow 
truncated normal distributions and have compact support [0,1]^ satisfy- 
ing assumption (A2). Both of the kernels K that we used for the backfit- 
ting algorithm and L for estimating m'j by (3.16) were the biweight kernel 
K{u) = L{u) = (15/16)(1 - u^fl[_i^i]{u). The weight function w in (3.5) 
and (3.6) was the indicator l{u G [0,1]^), and Wj in (3.13) and (4.1) was 
l(n,G[0,l]). 

Kernel density estimates of the densities of \og{hpLs,j) — ^og{hASE,j), 
log{hpLj) - log{hASE,j) and log{h*p^ j) - log{hASE,j) are overlaid in Fig- 
ures 1-3 for j = 1,2,3. The results are based on 500 replicates for the two 
choices of the correlation values and of the sample sizes. The kernel den- 
sity estimates were constructed by using the standard normal kernel and 



18 



E. MAMMEN AND B. U. PARK 




mo=o.5 
n=200 





Fig. 1. Densities o/log(/ii) — \og{hASE,i) constructed by the kernel method based on 500 
pseudosamples. The long-dashed, dotted and dot-dashed curves correspond to hi = hpLS,i, 
hpL.i and hpj^ i, respectively. 



the common bandwidth 0.12. The iterative procedures described in Sec- 
tion 4 for hpLs, hpL and h*pi were used here. In ah cases, the initial band- 
width /iM = (0.1,0.1,0.1) was used. For hpi, the first proposal with three- 
dimensional grid search was implemented. We tried g = 1.5h and g = 2h to 
estimate m'j in the iterative procedures. We found there is little difference 
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mo=o.5 
n=200 





Fig. 2. Densities o/log(/i2) ~\og{hASE,2) constructed by the kernel method based on 500 
pseudosamples. The long-dashed, dotted and dot-dashed curves correspond to hpLs,2, hpL,2 
and h*pi^2i respectively. 



between these two choices, and thus present here only the results for the 
case g = 1.5h. In each of Figures 1-3, the upper two panels show the den- 
sities of the log differences for the sample size n = 200, while the lower two 
correspond to the cases where n = 500. 

Comparing the three bandwidth selectors hpis, hpi and /ip^,, one sees 
that the penalized least squares bandwidth has the correct center while the 
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-2 



Fig. 3. Densities of\og{hl) — \og{hASE.3) constructed by the kernel method based on 500 
pseudosamples. The long-dashed, dotted and dot-dashed curves correspond to hpLs;i, hpL,s 
and hpj^'^, respectively. 



two plug- in bandwidths are positively biased toward Hase- Furthermore, 
hpLS are less variable than hpi and h*pi as an estimator of Hase- This 
shows the penalized least squares approach is superior to the other two 
methods in terms of estimating Hase- We found, however, hpi and /ip^ are 
more stable and less biased as estimators of /lopt and h*^^, respectively. 
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Table 1 _ _ _ 
Averages of ASE{h) and ASE j(h) (j = 1,2,2)) forh = hpLS, hpi and h*pj^, based on 

500 pseudosamples 



hpis hpL hpi 















g = 1.5/1 


g = 2h 


g = 1.5/1 


g = 2h 


Average 


n = 


200 




= 


0.00251 


0.00347 


0.00350 


0.00471 


0.00478 


ASE 






P = 


0.5 


0.00247 


0.00362 


0.00367 


0.00513 


0.00521 




n = 


500 




= 


0.00130 


0.00195 


0.00199 


0.00269 


0.00277 








P = 


0.5 


0.00133 


0.00209 


0.00213 


0.00294 


0.00303 


Average 


n = 


200 


P^ 


= 


0.00107 


0.00131 


0.00133 


0.00169 


0.00172 


ASEi 






P = 


0.5 


0.00112 


0.00150 


0.00153 


0.00207 


0.00211 




n = 


500 


P = 


= 


0.00045 


0.00063 


0.00065 


0.00084 


0.00088 








P = 


0.5 


0.00052 


0.00076 


0.00078 


0.00103 


0.00108 


Average 


n = 


200 


P = 


= 


0.00104 


0.00085 


0.00085 


0.00078 


0.00078 


ASE2 






P = 


0.5 


0.00100 


0.00079 


0.00079 


0.00072 


0.00072 




n = 


500 


P = 


= 


0.00044 


0.00037 


0.00037 


0.00033 


0.00033 








P = 


0.5 


0.00047 


0.00038 


0.00038 


0.00034 


0.00034 


Average 


n = 


200 


P^ 


= 


0.00112 


0.00079 


0.00079 


0.00073 


0.00073 








P = 


0.5 


0.00121 


0.00090 


0.00090 


0.00086 


0.00086 




n = 


500 


P^ 


= 


0.00051 


0.00038 


0.00037 


0.00034 


0.00033 








P = 


0.5 


0.00061 


0.00050 


0.00050 


0.00047 


0.00047 



It is also interesting to compare the performance of the bandwidth selec- 
tors in terms of the average squared error of the resulting regression esti- 
mator. Table 1 shows the means (out of 500 cases) of the ASE and ASEj 
for the three bandwidth selectors. First, it is observed that hpis produces 
the least ASE. This means that hpLs is most effective for estimating the 
whole regression function. Now, for accuracy of each one-dimensional com- 
ponent estimator, none of the bandwidth selectors dominates the others in 
all cases. For ASEi, the penalized least squares bandwidth does the best, 
while for ASE2 and ASE^ the plug-in h*pj^ shows the best performance. 
The backfitting estimates the centered true component functions because of 
the normalization (3.4). Thus, fhj^{xj) estimates mj{xj) — Emj{Xj), not 
mj{xj). We used these centered true functions to compute ASEj. 

Table 1 also shows that the means of the average squared errors are re- 
duced approximately by half when the sample size is increased from 200 to 
500. Although not reported in the table, we computed E {hj ^200) / 500) 
for the three bandwidth selectors, where hj^n denotes the bandwidth selec- 
tor for the jth component from a sample of size n. We found that these 
values vary within the range (1.20,1.26) which is roughly (200/500)"^/^. 
This means the assumed rate 0{n~^^^) for the bandwidth selectors actually 
holds in practice. Now, we note that the increase of correlation from to 0.5 
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does not deteriorate much the means of the ASE and ASEj. However, we 
found in a separate experiment that in a more extremal case of pij = 0.9 the 
means of the ASE and ASEj are increased by a factor of 3 or 4. In another 
separate experiment where the noise level is 0.1, that is, the errors are gen- 
erated from A^(0,0.1), we observed that the means of the ASE and ASEj 
are increased by a factor of 3 or 4, too. The main lessons on comparison of 
the three bandwidth selectors from these two separate experiments are the 
same as in the previous paragraph. 

Figure 4 visualizes the overall performance of the backfitting for the three 
bandwidth selectors. For each h = Hase, hpis, hpi, h*pj^, we computed 
ASE{h) and ASEj{h) for 500 datasets and arranged the 500 values of d = 
ASE{h) or ASEj{h) in increasing order. Call them < (i(2) < • • • < c^(500)- 
Figure 4 shows the quantile plots {i/500, d(j)}f£'{ for the case where n = 500 
and pij = 0.5. The bandwidth g = 1.5/i was used in the pilot estimation step 
for the two plug-in bandwidths. The figure reveals that the quantile func- 
tion of ASE{h) for h = hpis is consistently below those for the two plug-in 
rules and is very close to that for h = Hase- For ASEj{h), none of the three 
bandwidth selectors dominates the other two for all j, the result also seen 
in Table 1, but in any case the quantile function of ASEj{hpLs) is closest 
to that of ASEj{hASE)- We note that the quantile functions of ASE j{hASE) 
are not always the lowest since Kase = {hASE,i-,hASE,2-,hASE,'i) does not 
minimize each component's ASEj. 

Asymptotic theory says that in first order the accuracy of the backfitting 
estimate does not decrease with increasing number of additive components. 
And this also holds for the backfitting estimates with the data-adaptively 
chosen bandwidths. We wanted to check if this also holds for finite sam- 
ples. For this purpose we compared our model (Ml) with three additive 
components with a model that has only one component, 

(M2) Y' = mi[X{) + e\ 

We drew 500 datasets of sizes 200 and 500 from the models (Ml) and (M2). 
The errors and the covariates at the correlation level 0.5 were generated in 
the same way as described in the second paragraph of this section. The 
penalized least squares bandwidth for the single covariate case, denoted by 
hpLS(i)i "^^s obtained by minimizing 



PLSi{h^) = i?55i(^i)|l + ^^}, 



where RSSi{hi) = ELiI^"* - m[^{X\-hi)Y and m[^(-; hi) is the ordi- 
nary local linear fit for model (M2) with bandwidth hi. The plug-in band- 
width selector, /jpl(i), for the single covariate case was obtained by a formula 





0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 

Fig. 4. Quantile functions of ASE{h) and ASEj{h) for Hase and the three bandwidth 
selectors. Solid, long-dashed, dotted and dot-dashed curves correspond to Hase, hpLS, hpL 
and h*pi^, respectively. The sample size was n — 500 and the correlations between the co- 
variates were all 0.5. 



similar to the one in (4.1), where RSS is replaced by RSSi and nii , instead 
of the backfitting estimate mf'^, is used to calculate the local quadratic es- 
timate of m'(. For /ipL(i), an iterative procedure similar to those described 
in Section 4 was used here, again with the choice ^ = 1.5h. Table 2 shows 
E{ASEi{h)} for h = hpLs, ^pls'{i)i hpi, h*p^ and hpniy Also, it gives the 
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Table 2 

Averages of ASEi(h) as an error criterion for estimating the first 
component mi, based on 500 pseudosamples from the models (Ml) 



and (M2) 








hpLS 




hpL 


hpL 


hpL(i) 


n = 200 


P = 


= 


0.00107 


0.00034 


0.00131 


0.00169 


0.00029 








(3.147) 




(4.517) 


(5.828) 






P = 


0.5 


0.00112 


0.00033 


0.00150 


0.00207 


0.00028 








(3.394) 




(5.357) 


(7.393) 




n = 500 


P = 


= 


0.00045 


0.00015 


0.00063 


0.00084 


0.00014 








(3.000) 




(4.500) 


(6.000) 






P = 


0.5 


0.00052 


0.00014 


0.00076 


0.00103 


0.00013 








(3.714) 




(5.846) 


(7.923) 





Also given in the parentheses are the relative increases of E{ASEi{h)} 
due to the increased dimension of the covariates. The choice g = 1.5h 
was used for the plug-in rules. 



relative increases of E{ASEi{h)} due to the increased dimension of the co- 
variates. For /ipL5(i) and /ipL(i) the one-dimensional local linear estimate 
rfii^ and the noncentered regression function mi were used to compute the 
values of ASEi. 

From Table 2, it appears that the increased dimension of the covariates has 
some considerable effect on the regression estimates. The relative increase 
of ASEi for the penalized least squares bandwidth is smaller than those for 
the plug-in rules, however. Also, one observes higher rates of increase for 
the correlated covariates. An interesting fact is that /ipL(i) is slightly better 

than /jpL5'(i) in the single covariate case. The results for the other compo- 
nent functions, which are not presented here, showed the same qualitative 
pictures. 

6. Assumptions, auxiliary results and proofs. 

6.1. Assumptions. We use the following assumptions. 

(Al) The kernel K is bounded, has compact support ([—1.1], say), is sym- 
metric about zero and is Lipschitz continuous, that is, there exists a 
positive finite constant C such that \K{ti) — K{t2)\ < C\ti — t2\- 

(A2) The d-dimensional vector X* has compact support Ii x ■ ■ ■ x for 
bounded intervals Ij , and its density p is bounded away from zero and 
infinity on /i x • • • x The tuples (X*,e*) are i.i.d. 

(A3) Given X* the error variable e* has conditional zero mean, and for some 
7 > 4 and C < oo 

E[\e'\^\X']<C' a.s. 
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(A4) The functions rnj,p'j and [d / dxj)pjk{xj,Xk) il<j,k<d) exist and 
are continuous. 

(A5) The kernel L is twice continuously differentiable and has bounded 
support ([-1,1], say). 

6.2. Auxiliary results. In this section we will give higher-order expan- 
sions of fhj^^ and rhj^. These expansions will be used in the proofs of 
Theorems 2.1, 2.2 and 3.3. The expansions given in [9] are only of or- 
der Op(n~^/^). Furthermore, they are not uniform in h. For the proof of 
our results we need expansions of order Op{n~^/'^). First, we consider the 
Nadaraya- Watson smooth backfitting estimate fh^^ . 

As in [9] we decompose fh^^ into 

{Uj)=m- [Uj)+7nj [uj), 
where fh^^'^ {S = A, B) is defined by 
m- [uj) = m- [uj) 

■NW,A_ is-^n i -~.NW,B _ i^n \X^d 



where fh'^ "^'^ = n"! ELi ^^ ' = ELil^^o + Ei=i "b' (^j)} and 

n 



n 

~NW,A/ 

1=1 



n ( d ^ 

fhf^'^{uj)=pj{ujy^n~^J2 {uj ,X'j)l mo + ^ (X] ) L 

i=l I j=l J 

Here fn^^'^ and fn^^'^ are related to the sum of the true function and 
the bias, whereas fh^^'^ and rh^^'^ represent the "stochastic" part. In 
particular, fn^^'^ and ffi^^'^ do not depend on the error variables. 
We now state our stochastic expansions of fh^'^'^ and m^^'^ . 

Theorem 6.1. Sii^pose that the assumptions of Theorem 2.1 apply, and 
that fhj ' and fh- ' are defined according to (6.1). Then there exist 
random variables Rn,ij{uj,h,X), depending on < uj < 1, h = {hi, . . . ,h(i) 
and X = {X^^ , . . . ,X") (but not on e), such that 

n 

(6.2a) m^^'^(nj) =fhf^'^{uj) + Rn,ij{uj,h,X)e\ 

1=1 
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(6.2b) sup sup \Rn,i,j{uj,h,X)\ = Op{l), 

0<Uj<lCHn-^/^<hi,...,ha<C'jjn-'^/^ 

sup sup \Rn,i,j{Uj,h',X) - Rn,ij{Uj,h,X)\ 

0<Uj<lCHn~^/^<hi,h[,...,ha,h'^<C'gn-^/5 

(6.2c) 

= ^^\h'j — hj\Op{n") for some a>0. 

Furthermore, uniformly for Cnn"^^^ < hi, . . . ,ha < C'^n~^^^ and <Uj < 
1, 



(6.3) m^^'^{uj) =m^^'^{uj) +n ^'^rij{uj)e^ + Op{n -^/^), 

i=l 

(6.4) fnf ^'^(TXj) = mjiuj) + Op{n~^/^), 

where rij are absolutely uniformly bounded functions with 

(6.5) \rij{u'j) - rij{uj)\ < C\u'j - Uj\ 

for a constant C > 0. In particular, uniformly for CHn~^/^ < hj < C'^n~^/^ 
and hj <Uj<l — hj , 

(6.6) fh^^'^{uj) = mj{uj) + I3j{h, uj) + Op(n"^/^), 
where f3j is chosen so that 



Pj {h, Uj )pj {uj ) duj 

= ^h'j J [m'j{xj)pj{xj) + ^mj{xj)pj{xj)] dxj J u^K{u) du. 

This choice is possible because of J (3{h,x)p{x) dx = — Z)j=i7nj- 

We now come to the local linear smooth backfitting estimate. For a the- 
oretical discussion, we now decompose this estimate into a stochastic and a 
deterministic term. For S = A,B, define rhj^'^ by 



■ ^LL,S, N \ /^LLS\ / ~LL,S/ 

[ ^''^\ ]=-( "^0^ ^ + I ""A ' 

m 
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n 

,LL,S 



(6.8) mf'^ = n"^^y^.^ 



i=l 



fhj ' {uj)pj{uj)duj + j fh- {uj)pj{uj)duj = 0, 

(6-9) J = l,...,d, 

where Y^'^ = for S = A and niQ + I]j=i "^j(^j) for S = B. Furthermore, 
fhj^'^ and rh^^'^'^ are the local linear estimates of the function itself and 
its first derivative, respectively, for the regression of (for S = A) or mo + 
mi{X{) + ■■■ + nidiXdY (for S = B) onto Xj. 

For the local linear smooth backfitting estimate, we get the following 
stochastic expansions. 

Theorem 6.2. Suppose that the assumptions of Theorem 3.1 apply, and 
that m J ' andfh- ' ' {s = A,B) are defined according to {6. 7)-{6. 9). Then 
there exist random variables R^^ j{uj,h,X) such that 



(6.10a) rhf^'^iuj) = mf^'^{uj) + 7i~^ RnA,jiuj,h, X)e\ 

i=l 



(6.10b) sup sup \R^^ij{uj,h,X)\=Op{l), 

0<%<lCHn-i/5<hi,...,hd<C;^n-l/5 

sup sup \RnXj{uj,h', X) - R^^ij{uj,h, X) 

0<Uj<lc„n~^/5<hi,h[,...,ha,h'^<C'jjn-^/^^ 

(6.10c) ^ 

= ^ \ h'j — hj\Op{n"') for some a> 0. 

Furthermore, uniformly for Cnn^^^^ ^ hi, ■ ■ ■ ,hd < C^n^^/^ and <Uj < 
1, 



(6.11) mf '^(n,) = mf '^(^z,) + n"^ J2 ^'^i^jV + Op{n~'/'), 



^LL,Ar \ ~LL,A, 

Uj ) -f- n 

i=l 

(6.12) mj'^'^(nj) = mj{uj) + Op{n~^'^), 

where r/j-^ are absolutely uniformly bounded functions that satisfy the Lips- 

chitz condition (6.5). Furthermore, uniformly for CHn~^/^ < hj < C'^n~^/^ 
and hj <Uj<\ — hj , we have 

(6.13) fhj^'^iuj) = mj{uj) + \m"j{uj)h] j t^K{t) dt + Op{n~'^/^). 
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6.3. Proofs. 

Proof of Theorem 6.1. For an additive function f{x) = fi{xi)-\ h 

fdixd) we define 

^jf{x) = fl{xi) + ■■■+ fj-l{Xj-l) + f*{Xj) + fj+l{Xj + l) + ■■■ + fdiXd), 

where 

fji^j) = -Yl I fk{xk)^^^4^Y^^dxk + Y^ I fk{xk)Pk{xk)dxk. 
k^j-' Pi^^il k ■' 



According to Lenima 3 in [9], we have for m^^''^(x) = ttiq ' +fhi ' {xi) + 

oo 
s=0 

Here, f = $rf • • • $i and 

9{x) = $d • • • §2[mf ^'"^(x) - m^r^] + • • • + ^d[rh^I^'^{x) - m^J_:f] 
, ~NW,Ar \ ~NW,A 

where, in a shght abuse of notation, fhj{x) = mj{xj) and ?no,j = / iTij{xj) x 
Pj{xj) dxj. 

We now decompose 

oo oo 

(6.14) m^^'^(x) = m^^'^(x) + £ f ''(f - m^^'^)(x) + £ f ^m^^'^(x), 

s=0 s=l 

where ■m^^'^{x) = fh^^'^{xi) + • — h m^^''^(xrf). We wiU show that there 
exist absolutely bounded functions a^{x) with |a*(x) — a*(y)| < C||x — y|| for 
a constant C such that 

oo n 

(6.15) T'm^^^'^ix) = n-^ ^ a'{x)e' + Op(n-^/2) 

uniformly for C//?i~^/^ < /ij < C^n~^/^ and < Xj < 1. A similar claim 
holds for the second term on the right-hand side of (6.14). This immediately 
implies (6.3). 

For the proof of (6.15) we show that there exist absolutely bounded func- 
tions 6* with |6*(x) — &*(y)| < C||x — y|| for a constant C such that 

n 

(6.16) f m^^'^(x) = ^ b\xy + Op(n-i/2), 

i=l 

oo oo 

(6.17) ^™^^'^(x) = ^TTO^^'^(x) + Op(n-i/2). 
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Here T = • • • and 

^,f{x) = /l(xi) + • • • + fj-l{x,-l) + f**{x,) + + + + • • • + fd{Xd) 

for an additive function f{x) = fi{xi) + • • • + fd{xd) with 

/7**(^i) = -y^ / fk{xk)^^^^^^-p^dxk+^ [ fk{xk)pk{xk)dxk. 
Note that (6.15) fohows immediately from (6.16) and (6.17), since 

oo oo 

^ fs^NWA^^-) = ^ r^fm^^-^(:r) + Op(n-^/2) 
s=l s=0 

n r cxD 

(x)e^ + Op(n-V2)^ 



1=1 



We prove (6.16) first. For this purpose, one has to consider terms of the 
form 



n 

i=l 



We make use of the following well-known facts: 

(6.18) Pjk{xj,Xk) = E{pjkixj,Xk)} + Op{n~^/^^Vlogn), 

(6.19) pjixj) = E{pj{xj)} + Op(n-2/5^/b^), 

(6.20) {d/dxj)pjk{xj,Xk) = E{{d/dxj)pjk{xj,Xk)} + Op{7i~^^^°Vlogn), 

(6.21) {d/dxj)pj{xj) = E{id/dxj)pj{xj)} + Opin^^/^Vh^), 

uniformly for Cnn"^^^ < hj,hk < C'^^n"^^^ and < Xj,Xk < 1, ^ ^ j, k< 
d. 

We now argue that 



Pjkixj,Xi) 



-e 



fiPj{xj)Pk{Xl) 

(6.22) 

J2 ^kj {xj , hj , hk)e' = Op{n-^^^), 



n 

— 1 ■ 

: n 

i=l 



uniformly in Xj,hj,hk- From (6.18)-(6.21) and from the expansions of the 
expectations on the right-hand sides of these equations we get 

Akjixj,hj,hk) = Op{n-^/^), 
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uniformly in Xj,hj, h^. Furthermore, we have, because of S[|e*|(^/2)+5| j^i] < 
C for some > 0, C < +00, that for a sequence c„ — > 

E[\e'\l{\e'\>n^/^)\X'] < CnU"'^/^, 

P{\e'\ <r?l^ for l<i<n)^l. 

This shows that 

n n 

(6.23) n^^^Akj{xj,hj,hk)e' -n~^^Akj{xj,hj,hk)el = Op{n~^/'^) 

i=l i=l 

uniformly in Xj , hj , hk , where 

ei = e'l{\e'\<n^/^) - E[eh{\e'\<n^^^)\X% 
Note now that, with X = {X^, . . . , X") and A = n^^^ ^^Pk,j,xj,hj,ht I'^kjixjjh 



'J' 



1=1 



< E 



X 



exp(-ni/iO) 



expl J2 ^kj{xj,hj,hk)ei 

[ i=i 

< n ^{ exp{n-3/iOAfc,(x,>j, /ifc)e:}|X} exp(-nV^O) 

i=l 

n 



i=l 

X exp(-n^/^°) 



<exp|A2 sup ^[(4)2|X*]exp(2n-i/^°A)lexp(-ni/i°) 

I l<i<n J 

<M„exp(-ni/i°) 

with a random variable M„ = Op(l). Together with (6.23) this inequality 
shows that (6.22) uniformly holds on any grid of values of Xj,hj and /i^ with 
cardinality being of a polynomial order of n. For ai, 02 > large enough and 
for a random variable i?„ = Op(l), one can show 

\Akj{x'j,h'j,h'^) - Akj{xj,hj,hk)\ 

<Rn{n'''\x'j- Xj I + 1 /^z _ /j^. | + | /i'^ -hk\). 

This implies that (6.22) holds uniformly for < Xj < 1 and CHn~^^^ < 
hj,hk < C^n~^/^. By consideration of other terms similar to Skj{xj), one 
may complete the proof of (6.16). 
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We now come to the proof of (6.17). With the help of (6.18)-(6.21) one 
can show by using the Cauchy-Schwarz inequahty that 

(6.24) sup sup \ff{x) - Tf{x)\ = Op(n-Vio^i^). 

||/||<lO<2:i,...,a;d<l 

Here the first supremum runs over all additive functions / with / f^{x)p{x) dx < 
1. (The slow rate is caused by the fact that pj is inconsistent at Xj in neigh- 
borhoods of and 1.) Furthermore, in [9] it has been shown that 

(6.25) sup sup |f/(x)| = Op(l), 

||/||<lO<xi,...,Xd<l 

(6.26) sup ||r/|| < 1, 

ll/ll<i 

where ||r/|p = /{T/(x)}^p(x) dx. Claim (6.17) now follows from (6.16), 
(6.24)-(6.26) and the fact 

oo oo s — 1 

^ ^^ {T^ r*) ^ ^ ^ ^ (T 1— t 

s=l s=lt=0 

Proof of (6.2a)~(6.2c). Formula (6.2a) is given by the definition of 
^NW,A^ Claim (6.2b) follows as in the proof of (6.3). For the proof of (6.2c) 

one uses bounds on the operator norm of T^i — Th, where Th is defined as T 
with bandwidth tuple h. □ 



Proof of (6.4) and (6.6) . These claims follow by a slight modification 
of the arguments used in the proof of Theorem 4 in [9]. There it has been 
shown that (6.6) holds for fixed bandwidths hi,. . . ,hd and uniformly for Uj 
in a closed subinterval of (0,1). The arguments can be easily modified to 
get uniform convergence for hj < uj < 1 — hj and Cnn"^^^ ^ hi, . . . ,hd < 
C'^n~^/^. In Theorem 4 in [9] a wrong value was given for 7n,j; see the 
wrong proof of (114) in [9]. A correct calculation gives ^n,j as stated here. 
□ 



The proof of Theorem 6.1 is complete. □ 



Proof of Theorem 6.2. Theorem 6.2 follows with similar arguments 
as in the proof of Theorem 6.1. Now one can use Theorem 4' of [9]. For 
the proof of (6.13) note that we use another norming for fhj (cf. (6.9) 
with (52) in [9]). Formula (6.13) follows from Theorem 4' of [9] by not- 
ing that / fhj^'^'^ (uj) X Pj{uj) duj = —^n,j + op{ti?^^) with defined as 
in Theorem 4' of [9]. □ 
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Proof of Theorem 2.1. With Wi = i(;(X^)l(C^n"i/5< < 
for 1 < j < d), we get 

-in Q n 

RSS{h) - ASE{h) = -ywi{e'f - -Y Wi{m^^ {X') - m{X')]e\ 

n n ^ 

1=1 1=1 

where m^^(x) = ffiQ^ + fh'^^ (xi) + • • • + fh^^ (x^) and m(x) = niQ + 
mi{xi) + - ■ ■ + ma(xd)- We will show that uniformly for Cj/n^^/^ l^hi, . . . ,hfi< 

c'^n-y\ 

(6.27) -ywi{m^^^^{X') - m{X')}e' =oJn-^'^) 

n 

1=1 



and 



(6.28) 



-y^i;,m^^'^(X*)e* 

n ^ 

1=1 



1 " 1 

irt . ^ . _, ?T it ■J 

where for S = A,B we write 



m ' {x)=mQ ' (a^^i) H \-m^ [Xd)- 

The statement of Theorem 2.1 immediately follows from (6.27) and (6.28). 

For the proof of (6.27) one can proceed similarly as in the proof of (6.22). 
Note that 

sup sup n^/^\wi{m^^'^{X') - m{X')}\ = Op(l), 

l<j<nc'^„-i/5</j^ /i^<C^„-i/5 

and that differences between values of {?n^^'^(X*) — ?n(X*)} evaluated 
for different bandwidth tuples {h'l, . . . , h'^) and {hi, . . . , hj) can be bounded 
by X]j Wj ~ hj\Op{n'^) with a large enough. 

For the proof of (6.28) we note first that by application of Theorem 6.1, 



1=1 



-I n -t n d 

-5]u;,m^^'^(X*)e* + ^ ^ ^ u;,i2„,fc,,(X], /i, X)eV^ 

i=l i,k=lj=l 



-. n d 



i=ij=i 
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-. n d 

+ — E E mRn,iA^j^h, X){e')^ 
^ i=ij=i 

1 

+ ;^ E E ^iRn,k,j i^j ,h,xye'' 
= Ti{h) + ---+n{h). 

Now, it is easy to check that uniformly for Cijn"^/^ l^hi^ . . . ,h(i< C'^n~^/^ , 




\T,{h)\<0,{l)-Y.^e^? = Op{n~'). 
So, it remains to show 

(6.29) T2{h) = 0p{n-^/'), 

(6.30) T^{h) = Op{n-^/^). 

We wiU show (6.29). Claim (6.30) follows by slightly simpler arguments. 
For (6.29) it suffices to show that for 1 < j < d 

(6.31) Tl^ih) ^l^Y.^,hj'K[hj\X}-X^Mr,'' = Op{n-''/^), 

where rf = e*l(|e*| < n") - E[e^l{\e^\ < n")|X*] with I/7 < a < 1/4. The 
constant 7 was introduced in assumption (A3). It holds that £'|e*|'^ < C for 
some C < 00 ; see assumption (A3) . Note that 

P(|e*| > for some i with l<i<n) < nE\e^['n~'^'^ 0, 

E[\e'\l{\£'\ > n")|X^] < ^[|e*|^|X>-(^"^)" < C"n"(^-^)" = 0{n-'^/^). 

We apply an exponential inequality for U -statistics. Let 

kI = E{2-\w, + WkW^^K[hf{Xi - Xj=)]r?*7?n', 

Mn = snv{2-\wi + WkW^''K[h-\X] - Xj=)]??S'}, 

where the supremum in the definition of Mn is over the whole probability 
space. We note that = 0(1) and M„ is bounded by a constant which is 
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0{'n?"n^^^^). According to Theorem 4.1.12 in [2], for constants ci,C2 > and 
0<5< i - 2a, 



P{\T*{h)\>n-'/'-') 



< P 



n 



>CHn 



I/IO-S 



< ci exp 



C2n 



l/10-(5 



Kn + {M„n(VlO-5)/2^-l/2}2/3_ 

This gives with p= (1 — 26 — 4q!)/3 > and a constant C3 > 0, 
P{\Tl^ih)\>n~^/''~') < ciexp(-C3nO. 

Together with \T2j{h') — T2*j(/i)| < cn'^\h'j — hj\ for c, a > large enough, 
this imphes (6.31). □ 

Proof of Theorem 2.2. Claim (2.11a) follows from the expansions of 
Theorem 6.1. For the proof of (2.11b) note that 

n 

i?55(/i) =n"i^'Uj(X*)(e^)2 + Op(l) 

i=l 

because of (2.11a) and Theorem 2.1. □ 

Proof of Theorems 3.1-3.3. Theorems 3.1 and 3.2 follow with sim- 
ilar arguments as in the proofs of Theorems 2.1 and 2.2. For the proof of 
Theorem 3.3, one uses 



(6.32) 



sup \'mj^{vj) — mj{vj) \ = Op{n ^/^\/logn). 



This can be shown by use of the expansions of Theorem 6.2. By standard 
arguments in local polynomial regression (see [3], e.g.), it follows that for 
mj(iij) defined in (3.16), 

mj{uj) — m!-{uj) 



2gf jL*[gj\v,-u,)] 

X {■m^^{vj) — mj{uj) — m'j{uj){vj — Uj) 



^mj{uj){vj -Uj)'^}dvj, 

where L* is the so-called equivalent kernel having the property that jL*(vj) dvj 
J VjL*{vj) dvj = and / VjL*{vj) dvj = 1. Application of (6.32) gives 

L*[9]'^{vj - Uj)]{mj^{vj) - mj{vj)} dvj = Op{l). 
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Now, the fact that the function itself and its first two derivatives at Uj of 
= mj{-) — mj{uj) — m'j{uj){- — Uj) — m'-{uj)[- — Ujf' j2 are all zero yields 

x{mj{vj) — nijiuj) — ni'j{uj){vj — uj) — ^m"j{uj){vj — uj)^} dvj 

= 0(1). □ 
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